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Estimation of the Direct-Path Relative Transfer 
Function for Supervised Sound-Source Localization 

Xiaofei Li, Laurent Girin, Radu Horaud and Sharon Gannot 


Abstract —This paper addresses the problem of binaural lo¬ 
calization of a single speech source in noisy and reverberant 
environments. For a given binaural microphone setup, the bin¬ 
aural response corresponding to the direct-path propagation of 
a single source is a function of the source direction. In practice, 
this response is contaminated by noise and reverberations. The 
direct-path relative transfer function (DP-RTF) is defined as the 
ratio between the direct-path acoustic transfer function of the 
two channels. We propose a method to estimate the DP-RTF 
from the noisy and reverberant microphone signals in the short- 
time Fourier transform domain. First, the convolutive transfer 
function approximation is adopted to accurately represent the 
impulse response of the sensors in the STFT domain. Second, the 
DP-RTF is estimated by using the auto- and cross-power spectral 
densities at each frequency and over multiple frames. In the 
presence of stationary noise, an inter-frame spectral subtraction 
algorithm is proposed, which enables to achieve the estimation 
of noise-free auto- and cross-power spectral densities. Finally, 
the estimated DP-RTFs are concatenated across frequencies and 
used as a feature vector for the localization of speech source. 
Experiments with both simulated and real data show that the 
proposed localization method performs well, even under severe 
adverse acoustic conditions, and outperforms state-of-the-art 
localization methods under most of the acoustic conditions. 

Index Terms —binaural source localization, direct-path relative 
transfer function, inter-frame spectral subtraction. 


1. Introduction 

Sound-source localization (SSL) is an important task for 
many applications, e.g., robot audition, video conferencing, 
hearing aids, to cite just a few. In the framework of human- 
inspired binaural hearing, two interaural cues are widely used 
for SSL, namely the interaural phase difference (IPD) and the 
interaural level difference (ILD) III, El, (3, [01, (Si, (61, Q 
In the general case where the sensor array is not free-held, 
i.e. the microphones are placed inside the ears of a dummy 
head or on a robot head, the interaural cues are frequency- 
dependent due to the effects on sound propagation induced by 
the shape of the outer ears, head and torso |8l. This is true even 
for anechoic recordings, i.e. in the absence of reverberations. 
SSL is then based on the relationship between interaural cues 
and direction of arrival (DOA) of the emitting source. 
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When the short-time Fourier transform (STFT) is used, the 
ILD and IPD correspond to the magnitude and argument, 
respectively, of the relative transfer function (RTF), which 
is the ratio between the acoustic transfer functions (ATF) of 
the two channels la. In a reverberant environment, the RTF 
contains both direct-path information, namely the direct wave 
propagation path from the source location to the microphone 
locations, and information representing early and late reverber¬ 
ations. Extracting the direct path is of crucial importance for 
SSL. In an anechoic and noise-free environment the source 
direction can be easily estimated from the RTF. However, 
in practice, noise and reverberations are often present and 
contaminate SSL estimation. 

In the presence of noise, based on the stationarity of the 
noise and the non-stationarity of the desired signal, the RTF 
was estimated in |9l by solving a set of linear equations, and 
in M by solving a set of nonlinear decorrelation equations. 
In (TOl . the time difference of arrival (TDOA) was estimated 
based on RTF, and a TDOA tracking method was also pro¬ 
posed. These methods have the limitation that a significant 
amount of noisy frames are included in the estimation. An 
RTF identification method based on the probability of speech 
presence and on spectral subtraction was proposed in CD: this 
method uses only the frames which are highly likely to contain 
speech. The unbiased RTF estimator proposed in ifT^ is based 
on segmental power spectral density matrix subtraction, which 
is a more efficient method to remove noise compared with 
the approaches just mentioned. The performance of these 
spectral subtraction techniques was analyzed and compared 
with eigenvalues decomposition techniques in 113 . 

The RTF estimators mentioned above assume a multi¬ 
plicative transfer function (MTF) approximation flAj . i.e., 
the source-to-microphone filtering process is assumed to be 
represented by a multiplicative process in the STFT domain. 
Unfortunately, this is only justified when the length of the 
filter impulse response is shorter than the length of the STFT 
window, which is rarely the case in practice. Moreover, the 
RTF is usually estimated from the ratio between two ATFs 
that include reverberation, rather than from the ratio between 
ATFs that only correspond to the direct-path sound propaga¬ 
tion. Therefore, currently available RTF estimators are poorly 
suitable for SSL in reverberant environments. 

The infiuence of reverberation on the interaural cues is 
analyzed in ca. The relative early transfer function was in¬ 
troduced in ca to suppress reverberation. Several techniques 
were proposed to extract the RTF that corresponds to the 
direct-path sound propagation, e.g., based on detecting time 
frames with less reverberations. The precedence effect, e.g.. 
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El, widely used for SSL, relies on the principle that signal 
onsets are dominated by the direct path. Based on band-pass 
filter banks, the localization cues are extracted only from 
reliable frames, such as the onset frames in El, the frames 
preceding a notable maximum El, the frames weighted by 
the precedence model ll20l . etc. Interaural coherence was 
proposed in ED to select binaural cues not contaminated by 
reverberations. Based on Fourier transform, the coherence test 
II 22 I, and the direct-path dominance test 1^ are proposed to 
detect the frames dominated by one active source, from which 
localization cues can be estimated. However, in practice, there 
are always refiection components in the frames selected by 
these methods, due to an inaccurate model or an improper 
decision threshold. 

Contributions and Method Overview: In this paper, we 
propose a direct-path RTF estimator suitable for the local¬ 
ization of a single speech-source in noisy and reverberant 
environments. We build on the cross-band filter proposed 
in El for system identification in the STFT domain. This 
filter represents the impulse response in the STFT domain 
by a cross-band convolutive transfer function instead of the 
multiplicative (MTF) approximation. In practice we consider 
the use of a simplified convolutive transfer function (CTF) 
approximation, as used in l(25l . The first coefficient of the 
CTF at different frequencies represents the STFT of the first 
segment of the channel impulse response, which is composed 
of the direct-path impulse response, plus possibly few early 
refiections. In particular, if the time delay between the direct- 
path wave and the first notable refiection is large, less refiec¬ 
tions are included. Therefore, we refer to the first coefficient of 
the CTF as the direct-path acoustic transfer function, and the 
ratio between the coefficients from two channels is referred to 
as the direct-path relative transfer function (DP-RTF). 

Inspired by 1261 and based on the relationship of the CTFs 
between the two channels, we use the auto- and cross-power 
spectral densities (PSD) estimated over multiple STFT frames, 
to construct a set of linear equations in which the DP-RTF is 
the unknown variable. Therefore, the DP-RTF can be estimated 
via standard least squares. In the presence of noise, an inter¬ 
frame spectral subtraction technique is proposed, extending 
our previous work (121. The auto- and cross-PSD estimated 
in a frame with low speech power are subtracted from the 
PSDs estimated in a frame with high speech power. After 
subtraction, low noise power and high speech power are left 
due to the stationarity of the noise and the non-stationarity 
of the speech signal. The DP-RTF is estimated using the 
remaining signal’s auto- and cross-PSD. This PSD subtraction 
process does not require an explicit estimation of the noise 
PSD, hence it does not suffer from noise PSD estimation 
errors. 

Finally, the estimated DP-RTFs are concatenated over fre¬ 
quencies and plugged into an SSL method, e.g., [61. Exper¬ 
iments with simulated and real data were conducted under 
various acoustic conditions, e.g., different reverberation times, 
source-to-sensor distances, and signal-to-noise ratios. The ex¬ 
perimental results show that the proposed method performs 
well, even in adverse acoustic conditions, and outperforms 


the MTF-based method (TJl, the coherence test method (^ 
and the conventional SRP-PHAT method in most of the tested 
conditions. 


The remainder of this paper is organized as follows. Sec¬ 
tion [n| formulates the sensor signals based on the crossband 
filter. Section [represents the DP-RTF estimator in a noise-free 
environment. The DP-RTF estimator in the presence of noise 
is presented in Section |lv] In Section |Vj the SSL algorithm 


is described. Experimental results are presented in Section jVT 
and |VII| and Section |V|n| draws some conclusions. 


11. Cross-band Filter and Convolutive Transeer 
Function 


We consider first a non-stationary source signal s{n), e.g., 
speech, emitted in a noise-free environment. The received 
binaural signals are 

x{n) = s{n) i<a{n) 
y{n) = s{n)-kb{n), 

where * denotes convolution, and a{n) and 6(n) are the 
binaural room impulse responses (BRIR) from the source to 
the two microphones. The BRIRs combine the effects of the 
room acoustics (reverberations) and the effects of the sensor 
set-up (e.g., dummy head/ears). Applying the STFT, Q is 
approximated in the time-frequency (TF) domain as 

^p,k ~ ^p,k (^k ^ 2 ^ 

Vp^k — ^p,k ^ki 

where and are the STFT of the corresponding 

signals (p is the time frame index and k is the frequency 
bin index), and ap and are the ATFs corresponding to 
the BRIRs. Let N denote the length of a time frame or, 
equivalently, the size of the STFT window. Eq. Q corresponds 
to the MTF approximation, which is only valid when the 
impulse response a{n) is shorter than the STFT window. In 
the case of non-stationary acoustic signals, such as speech, 
a relatively small value for N is typically chosen to assume 
local stationarity, i.e., within a frame. Therefore, the MTF ap¬ 
proximation Q is questionable in a reverberant environment, 
since the room impulse response could be much longer than 
the STFT window. 


To address this problem cross-band filters were introduced 
(24l to represent more accurately a linear system with long 
impulse response in the STFT domain. Let L denote the frame 
step. The cross-band filter model consists in representing the 
STFT coefficient Xp^k in 0. as a summation over multiple 
convolutions across frequency bins (there is an equivalent 
expression for yp^k)' 


Qfc-l N-1 

^p,k ~ ^ ^ ^ ^ ^p—p',k' (^p',k',k‘ ( 3 ) 

p' = — C k'=0 


From [241 . if L < A^, then ap>^k',k is non-causal, with C = 
\N/L \ —1 non-causal coefficients. The number of causal filter 
coefficients Qk is related to the reverberation time at the k-ih 


frequency bin, which will be discussed in detail in Section VI 
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The TF-domain impulse response apf^k',k is related to the time- 
domain impulse response a{n) by: 

^p',k',k ^ C/c,/c'(^)) |n=p'L: (4) 

which represents the convolution with respect to the time index 
n evaluated at frame steps, with 

+ 00 

m= —oo 

(5) 

where uj{n) and uj{n) denote the STFT analysis and synthesis 
windows, respectively. A convolutive transfer function (CTF) 
approximation is further introduced and used in 1 ^ to sim¬ 
plify the analysis, i.e., only band-to-band filters are considered, 
k = k'. Hence, © is rewritten as 

Qfc-i 

^p,k — ^ ^ ^p—p',k(^p',k — ^p,k^(^p,k-) ( 6 ) 

p'=0 

where we assumed L ^ N such that non-causal coefficients 
are disregarded. Note that a^',k is replaced with to 
simplify the notations. The cross-band filter and CTF formal¬ 
ism will now be used to extract the impulse response of the 
direct-path propagation. 

III. Direct-Path Relative Transfer Function 

From 0 and 0 , with k' = k and p' = 0 , the first 
coefficient of ap'^k in the CTF approximation 0 can be 
derived as 

T-l 

~ (^'(n) ^ (tT/))[ 72=0 — ^ ^ Q'(^)C/c,fe( 

t=o 

N-l 

= Y (7) 

t=0 

where T is the length of the BRIR and 

/ X ifl-N<n<N-l, 

u[n) = < 

lO, otherwise. 

Therefore, ao,/c (as well as 60 ,/c) can be interpreted as the k-\h 
Fourier coefficient of the impulse response segment a{n)\^~Q 
windowed by iy{n)\^~Q. Without loss of generality, we assume 
that the room impulse responses a{n) and b{n) begin with 
the impulse responses of the direct-path propagation. If the 
frame length N is properly chosen, a{n)\^~Q and b{n)\^~Q 
are composed of the impulse responses of the direct-path and 
a few refiections. Particularly, if the initial time delay gap 
(ITDG), i.e. the time delay between the direct-path wave and 
the first notable reflection, is large compared to N, a{n)\^~Q 
and b{n)\^~Q mainly contain the direct-path impulse response. 
Therefore we refer to ao,/c and 60 ,/c as the direct-path ATFs. 
By definition, the DP-RTF is given by (we remind that the 
direct path is relevant for sound source localization): 

dk = ( 8 ) 

^ 0 ,/c 


In summary, the CTF approximation offers a nice framework 
to encode the direct-path part of a room impulse response into 
the first CTF coefficients. Applying this to each channel of a 
BRIR and taking the ratio between the first CTF coefficients 
of each channel provides the DP-RTF. Of course, in practice, 
the DP-RTF must be estimated from the sensor signals. 


A. Direct-Path Estimation 


Since both channels are assumed to follow the CTF model, 
we can write: 


^p,/c ^ ^p,k — ^p,k ^ ^p,k ^ ^p,k — Vp^k ^ ^p,k' (9) 

This relation was proposed in 1261 . I27l for the time-domain 
TDOA estimation and is here extended to the CTF domain. In 
vector form (|^ can be written as 

Xp.febfc = ( 10 ) 

where ^ denotes vector or matrix transpose, and 

^P,k - [^p,k-) 1,/C5 • • • 5 —(5fe + l,/c] 5 

yp,k — [yp,k’) yp—i,ki • • • ? yp—Qk+i,k] ? 

b/c = [bo,k, • • • , 

a/e = [« 0 ,/e, <^l,/c, • • • ,CiQk-l,kV' 

Dividing both sides of ( p^ by ao,/c and reorganizing the terms, 
we can write: 

yp,k =’^'J,kSk, ( 11 ) 

where 


Zp,/c — [^p,/c: • • • : ^p—(5fc + l,/c: yp—l,ki • • • 5 ^p—(5fc + l,/c] 

Cil,k CiQk-l,k 

, . . . , , , . . . , 

_CtO,k 


g/c = 


(l0,k <^ 0 ,/c CLo^k 

We see that the DP-RTF appears as the first entry of g/e. Hence, 
in the following, we base the estimation of the DP-RTF on 
the construction of Pp^k and Zp^k statistics. More specifically, 
multiplying both sides of (11) by ^ (the complex conjugate 
of yp^k) and taking the expectation, E{'}, we obtain: 


(l>yyiP, k) = 4>ly{p, k) gk, 


( 12 ) 


where (t)yy{p^ k) = ^{^p,/e^p,/c} is the PSD of y{n) at TF bin 
(p, /c), and 


4^zy{P^ ~[^{^P,kyp^k}^ • • • 5 -^{^p-Qfe + l,/c^p,/c}5 

-^{^P-l,/c^P,/c}5 • • • 5 -^{^p-Qfc + l,/c^p,/c}] 

is a vector composed of cross-PSD terms between the elements 
of Zp^k and pp^/^Qln practice, these auto- and cross-PSD terms 
can be estimated by averaging the corresponding auto- and 
cross-STFT spectra over D frames: 

1 

^yy{P^k) — ^ yp-d,k Pp-d^k' (i^) 

d=0 


^More precisely, cf)^y{p,k) is composed of y PSD ‘cross-terms’, i.e., y 
taken at frame p and previous frames, and of x,y cross-PSD terms for y 
taken at frame p and x taken at previous frames. 







4 


The elements in 4>zy{P^ can be estimated by using the same 
principle. Consequently, in practice ( p^ is approximated as 

^yy{P^ k) = ^~ly{P^ k) g/c- (14) 


Let P denote the total number of the STFT frames. Qk is 
the minimum index of p to guarantee that the elements in 
Zp^k are available from the STFT coefficients of the binaural 
signals. For PSD estimation, the previous D — 1 frames of 
the current frame are utilized as shown in Therefore, 
Pf = QkPD — l is the minimum index of p to guarantee that 
all the frames for computing <Pzy{P^ available from the 

STFT coefficients of the binaural signals. By concatenating 
the frames from pf to P, can be written in matrix-vector 
form: 


^yyik) = ^zy{k) gk, (15) 

with 

^yyi^) ~ i^yyiPf^ • • • 5 ^yy{p^ ^ ^ ^yy{P^ k)] , 

^zvik) = [^^y{pf, k),..., <j}^y{p, k),..., k)]^. 

Note that 4>yy{k) is a {P — pf -h 1) x 1 vector and ^zy{k) is a 
{P — Pf -\-l) X {2Qk — 1) matrix. In principle, an estimate gk 
of gk can be found be solving this linear equation. However, 
in practice, the sensor signals contain noise and thus the 
estimated PSD contain noise power. Therefore, we have to 
remove this noise power before estimating gk. 


IV. DP-RTF Estimation in the Presence of Noise 

Noise always exists in real-world configurations. In the 
presence of noise, some frames in are dominated by noise. 
Besides, the PSD estimate of speech signals is deteriorated by 
noise. In this section, an inter-frame subtraction technique en¬ 
abling to improve the DP-RTF estimation in noise is described, 
based on a speech frame selection process. 


A. Noisy Signals and PSD Estimates 

In the presence of additive noise Q becomes 

x{n) = x{n) + u{n) = a{n) -k s{n) + u{n), ^ 

y{n) = y{n) -h v{n) = b{n) ^ s{n) + v{n), 

where u{n) and v{n), the noise signals, are assumed to be 
individually wide-sense stationary (WSS) and uncorrelated 
with s{n). Moreover, u{n) and v{n) are assumed to be either 
uncorrelated, or correlated but jointly WSS. Applying the 
STFT to the binaural signals in ( p^ leads to 

d^p,k — ^p,k H“ "^Pjk 
yp,k — Up^k "^p^k-i 

in which each quantity is the STFT coefficient of its corre¬ 
sponding time-domain signal. Similarly to z^^/c, we define 

Zp,/c = \^p,k-) • • • : (5fc + l,/c5 Vp—l^ki • • • : Vp—Qk + l.kl 

— '^p,k ^p,k 


where 

^p,k — • • • 5 (5fc + l,/c5 1,/c: • • • : '^p—Qk + l,k] • 

The PSD of yp^k is (j)yy{p^k). We define the PSD vector 
4>zy{P^ k) composed of the auto- and cross-PSDs between the 
elements of Zp^k and yp^k- Following ( p^ , these PSDs can be 
estimated as (j)yy{p^k) and (p^yip^k) by averaging the auto- 
and cross-STFT spectra of input signals over D frames. Since 
the speech and noise signals are uncorrelated, we can write 

^Jyip, k) = (i>yy{p, k) + lj>vv(p, k), 

^zyiP, k) = ^^y{p, k) + fc), 

where (j)yy{p^k) is an estimation of the PSD of Vp^k^ and 
^^y{p, k) is a vector composed of the estimated auto- or cross- 
PSDs between the entries of and Vp^k- 


B. Inter-Frame Spectral Subtraction 

From ( p^ and ( p^ , we have for any frame p: 

hviP^ k) - 4>vv{p, k) = {^sy{p, k) - k)y^k, (18) 

or alternately: 

<PyyiP,k) = ^2y{p,k)'^gk + (i>vv{p,k)-^^^{p,k)'^gk. (19) 

By subtracting the estimated PSD (^yy{p, k) of one frame, e.g. 
P 2 , from the estimated PSD of another frame, e.g. pi, we 
obtain 

^ly{Pl:k) = (pyy{pi,k) - $yy{p2, k) 

= 4>ly {pi,k) + eyy (pi,k) (20) 

with 

^lyiPl,k) = 4>yyiPl,k) - ^yy{p2, k), 
evv{pi,k) = (j>vv{pi,k) - ^vv{P2, k). 

Applying the same principle to 4>zy{Pi k), we have: 

^ly{pi,k) = ^iy{pi,k) - 4>ifp2, k) 

= ^zyiPi^k)+ewv{Pi-,k), (21) 

with 

^ly{Pl, k) = ^zyiPl^ k) - ^^y{p2, k), 
ewv{pi, k) = A:) - k). 

Applying ( [T^ to frames p\ and P 2 and subtracting the result- 
ing equations, we obtain: 

^y{pi, k) = (j)l^{pi, kYgk + e{pi,k), (22) 

where 

e(pi, A:) = eyy{pi,k) - GwviPiXVEk- (23) 

Because v{n) is stationary, e^^(pi,/c) is small. Conversely, 
the fiuctuations of speech signals are much larger than the 
fiuctuations of the noise signal because the speech signals 
are both non-stationarity and sparse, i.e., speech power spec¬ 
trum can vary significantly over frames. Thence, by properly 
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choosing the frame indexes pi and p 2 , for instance in such a 
way that the speech power (j)yy{pi^k) is high and the speech 
power ^yy{p 2 ,k) is low, we have > eyy{pi,k), 

or equivalently eyy{pi,k). The same reasoning 

applies to eyjy{pi,k), except that the u-v cross-terms of 
^wv{Pi^k) are small compared to either if u and 

V are uncorrelated, or if u and v are jointly WSS, which are 
our (quite reasonable) working assumptions. 


The choice of the frame index necessitates to classify the 
frames into two sets, Vi and V 2 , which have high speech 
power and very low speech power, respectively. This is done in 
Subsection IV-D using the minimum and maximum statistics 
of noise spectrum. Before that, we finalize the estimation of 
the DP-RTF in the noisy case, based on ([22). 


C. DP-RTF Estimation 


Let Pi = \Vi\ denote the cardinality of Vi. The PSD 
subtractions ( |20| and ^2T\ are applied to all the frames pi G Vi 
using their corresponding frames p 2 G 7^2, denoted as ^ 2 (^ 1 )- 
In practice, P 2 {pi) is the frame in V 2 that is nearest to pi, 
since the closer the two frames, the smaller the difference of 
their noise PSD and the difference of their transfer function. 
The resulting PSDs and cross-PSD vectors are gathered into a 
Pi X 1 vector and a Pi x {2Qk — 1) matrix, respectively, as: 

^lyik) = [^yih k),..., $ly{Puk)r, 

^zy{k) = k),..., k),..., ^iy{Pi, k)]'^. 


Let US denote e{k) = [e(l, /c),..., e(pi, /c),..., e(Pi, k)]'^ the 

Pi X 1 vector that concatenates the residual noise for the 
Pi frames. Then, from (22) we obtain the following linear 
equation, which is the “noisy version” of 


^yy{k) = + e(fc). 


(24) 


Assuming that the sequence of residual noise entries in e(k) 
is i.i.d0 and also assuming Pi > {2Qj^ — 1), the least square 
solution to is given by: 

= (25) 

where ^ denotes matrix conjugate transpose. Finally, the 
estimation of the DP-RTF dk defined in •D is provided by 
the first element of g/^, denoted as ^o,/c- 


Note that if two frames in Pi are close to each other, their 
corresponding elements in vector ^yy{k) (or corresponding 
rows in matrix will be correlated. This correlation 

yields some redundancy of the linear equations. However, in 
practice, we keep this redundancy to make full use of data and 
give a more robust solution to ( [24| ). 


^This assumption is made to simplify the analysis. In practice, e{pi,k) may 
be a correlated sequence because of the possible correlation of (p, k) (or 
4>wv{P^ fc)) across frames. Taking this correlation into account would lead to 
a weighted least square solution to involving a weight matrix in 
This weight matrix is not easy to estimate, and in practice, ( [^ delivers a 
good estimate of ^o,/c? as assessed in our experiments. 


Still assuming that e{pi^k) is i.i.d and denoting its variance 
by cr^, the covariance matrix of g/c is given by (281: 

cov{gfe} = (26) 

The statistical analysis of the auto- and cross-PSD estimates 
show that is inversely proportional to the number of 
smoothing frames D (28l. Thence using a large D leads to 
a small error variance cr^. However, increasing D decreases 
the fiuctuation of the estimated speech PSD among frames 
and thus makes the elements in the matrix ^^y{k)^ 
smaller, which results in a larger variance of g/.. Therefore, 
an appropriate value of D should be chosen to achieve a good 
tradeoff between smoothing the noise spectrum and preserving 
the fiuctuation of speech spectrum. 

Finally, to improve the robustness of the DP-RTF estima¬ 
tion, we also calculate after exchanging the roles of the 
two channels in the whole process. This delivers an estimate 
Pq ^ of the inverse of i.e. an estimate of the inverse DP- 
RFT 1^. Both ^ 0 ,/c and ^ 0 are estimates of The final 
DP-RTF estimate is given by averaging these two estimates as: 

h = ^{go,k+go,k~^)- (27) 


D. Frame Classification 


We adopt the minimum-maximum statistics for frame clas¬ 
sification, which was first introduced in 03, and is applied to 
a different feature in this paper. Frame classification is based 
on the estimation of y PSD, i.e., fiyy (v^ k). The frame pi 


is selected such that (f)~~{pi^k) in (22) is large compared to 
e(pi, k), and thus (22) matches wef 


the noise-free case. 


As shown in (17), the PSD estimation (j)yy{p^k) is com 


posed of both speech and noise powers. A minimum statistics 
formulation was proposed in (23, where the minimum value 
of the smoothed periodograms with respect to the index p, 
multiplied by a bias correction factor, is used as the estimation 
of the noise PSD. Here we introduce an equivalent sequence 
length for analyzing the minimum and maximum statistics of 
noise spectra, and propose to use two classification thresholds 
(for two classes Vi and V 2 ) defined from the ratios between 
the maximum and minimum statistics. In short, we classify the 
frames by using the minimum controlled maximum border. 

Formally, the noise power in fiyy{p^ k) is 


Cp,/c — — jj ^ ^ I'^p—• ( 28 ) 

d=0 

For a stationary Gaussian signal, the probability density 
function (PDF) of periodogram obeys the exponential 

distribution (29l 

/(l^p.fehA) = (29) 

where A = E{\vp^k\‘^} is the noise PSD. Assume that the 
sequence of \vp^k\‘^ values at different frames are i.i.d. random 
variables. The averaged periodogram ^p^k obeys the Erlang 
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distribution 1301 with scale parameter fi = \/D and shape 
parameter D\ 

= Of) 

We are interested in characterizing and estimating the ratio 
between the maximum and minimum statistics of the sequence 
Since the maximum and minimum statistics are both 
linearly proportional to // 1 ^ . we assume, without loss of 
generality, that /i = 1. Consequently the mean value of ^p^k 
is equal to D. 


As mentioned in Section |III-A| the frame index of the 
estimated PSDs (j)yy{p^k) and ^p^k is confined to the range 
Pf to P. Let R denote the increment of the frame index p 
of the estimated PSDs. If R is equal to or larger than D, 
for two adjacent estimated PSD ^p^k and there is no 

frame overlap. The sequence ^p^k^ p = pf : R : P is then an 
independent random sequence. The length of this sequence is 
P = \• The PDFs of the minimum and maximum of 
these P independent variables are ED: 

fma.{0=P-F{0^-Pf{0, 


where F(-) denotes the cumulative distribution function (CDF) 
associated with the PDF (30). Conversely, if R < D, ^p^^ is a 
correlated sequence, and the correlation coefficient is linearly 
proportional to the frame overlap. For this case, ( [3T] ) will not 
be valid anymore. Based on a large amount of simulations 
using white Gaussian noise (WGN)j^ it was found that the 
following approximate equivalent sequence length 


PR f . fP 

P =-g-.(l + log(- 


(32) 


can replace P in order to make (31) valid for the correlated 
sequence. We observe that the ratio between the number D of 
frames used for spectrum averaging and the frame increment 
R of PSD estimates, is replaced with its logarithm. Note that 
this is an empirical result, for which theoretical foundation 
remains to be investigated. 


Then, the expectation of the minimum can be approximately 
computed as 


Cmin 


' fminimi) 


fminimi) 


(33) 


where G {0,0.ID, 0.2D,..., 3D} is a grid used to approx¬ 
imate the integral operation, which well covers the support of 


^The simulations are done with the following procedure: applying STFT to a 
number of WGN signals with identical long duration. For each time-frequency 
bin, estimate the PSD by averaging the periodograms of the past D frames. 
Without loss of generality, the scale parameter /j, of the PSD estimation can 
be set to 1 by adjusting the noise PSD A to D. A sequence of correlated 
PSD estimates is generated by picking PSD estimates from the complete 
sequence, with frame increment R (with R < D). The length of the correlated 
sequence is P. The minimum/maximum values of each correlated sequence 
are collected at each frequency for all the WGN signals. The PDF and CDF 
of the minimum/maximum statistics are simulated by the histograms of these 
minimum/maximum values. Fig. [T] shows some examples of this empirical 
CDF. 


M inimum M aximum 



Fig. 1: Cumulative distribution function (CDF) of the minimum and 
maximum statistics of for D = 12. 


the Erlang distribution with shape D and scale 1. Similarly, 
the CDF of the maximum can be estimated as 

FmaAO-P^fmacciCi)- (34) 

si 

Finally, we define two classification thresholds that are two 
specific values of the maximum and minimum ratios, namely 

^ ^F^aM)=0.d5 ^ ^ ( 0 = 0 . 5 ^ ^ 25 ) 

^min ^min 

where Cf^,,(^)=o .95 and Cf^,,(^)=o .5 are the values of ^ for 
which the CDF of the maximum is equal to 0.95 and 0.5, 
respectively. Classes Pi and P 2 are then obtained with 


Pi = {P \ ip.k > ri • min{^p,/,}}, (36) 

p 

P2 = {p \ ^p,k < T2 • minj^p,/,}}. (37) 

p 

These two thresholds are set to ensure that the frames in Vi 
contain large speech power and the frames in P 2 contain 
negligible speech power. The speech power for the other 
frames are probabilistically uncertain, making them unsuitable 
for either Vi or V 2 . Using two different thresholds evidently 
separates speech region and noise-only region. In other words, 
there is a low probability to have a frame classified into Pi 
in the proximity of P 2 frames, and vice versa. Therefore, in 
general, the PSD of a frame in Pi is estimated using D frames 
that are not included in the noise-only region, and vice versa. 
Note that if there are no frames with speech content, e.g., 
during long speech pauses, class Pi will be empty with a 
probability of 0.95 due to threshold ri. 


As an illustration of (32), Fig.[^shows the CDF for D = 12. 
The empirical curves are simulated using WGN, and the 
analytical curves are computed using the equivalent sequence 
length in ( [3^ . The minimum CDF and maximum CDF of 
two groups of simulations are shown, for which the equivalent 
sequence lengths P' are fixed at 20 and 100, respectively. For 
each equivalent sequence length P', two empirical curves with 
frame increment P = 1 and P = 6 are simulated using WGN, 
whose corresponding original sequence lengths are P = 69 
and P = 24 for P' = 20, and P = 344 and P = 118 
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for P' = 100, respectively. This shows that the equivalent 
sequence length in (^) is accurate for the minimum and 
maximum statistics. 


V. Sound Source Localization method 


The amplitude and the phase of DP-RTF represent the 
amplitude ratio and phase difference between two source-to- 
microphone direct-path ATFs. In other words, in case of two 
microphones, the DP-RTF is equivalent to the interaural cues, 
ILD and IPD, associated to the direct path. More generally, we 
consider here J microphones. This is a slight generalization 
that will directly exploit the previous developments, since we 
consider these J microphones pair-wise. As in (321, (3^ . we 
consider the normalized version of the DP-RTF estimate ( [Z7] ) 
between microphones i and j : 


^k,ij 


^k,ij 

\/l 3“ \^k,ij P 


(38) 


Compared to the amplitude ratio, the normalized DP-RTF is 
more robust. In particular, when the reference transfer function 
ao,/c is much smaller than 6 o,/c 5 the amplitude ratio estimation 
is sensitive to noise present in the reference channel. By 
concatenating ( [38] ) across K frequencies and across ( J—1) J/2 
microphone pairs, we obtain a high-dimensional feature vector 
c G Since speech signals have a sparse STFT 

representation, we denote by G an indicator 

vector whose elements are either equal to 1 if the energy 
at the corresponding frequency is significant, or equal to 0 
if the energy is negligible. In practice, the indicator vector 
entries at a given frequency k are set to 0 if the corresponding 
matrix underdetermined, i.e. Pi < {2Qk — l) for that 

frequency. This way, we do not use any DP-RTF calculated 
from ( [25] ) for such “missing frequency” (see below). 

The proposed DP-RTF estimation method is suitable for the 
most general case of microphone setup where the microphones 
are not necessarily placed in free-field. In other words it can be 
applied to any microphone pair in any microphone array setup. 
For instance, in the present paper, the microphones are placed 
in the ears of a dummy head or on the head of a robot. In 
these cases, there is no clear (analytical) relationship between 
the HRIR/HRTF/DP-RTF and the DOA of the emitting source, 
even after removal of the noise and reverberations. In order 
to perform SSL based on the feature vector c, we adopt 
here a supervised framework: A training set Dc,q of / pairs 
is available, where is a DP-RTF feature vector 
generated with an anechoic head-related impulse response 
(HRIR), and is the corresponding source-direction vector. 
Then, for an observed (test) feature vector c that is extracted 
from the microphone signals, the corresponding direction 
is estimated using either (i) nearest-neighbor search in the 
training set (considered as a look-up table) or (ii) a regression 
whose parameters have been tuned from the training set. Note 
that the training set and the observed test features should 
be recorded using the same microphone set-up. This way, 
the HRIR of the training set (corresponding to an anechoic 


condition) corresponds to the direct-path of the BRIR of the 
test condition (recorded in reverberant condition). 

Nearest-neighbor search corresponds to solving the follow¬ 
ing minimization problem (O denotes the Hadamard product, 
i.e. entry-wise product): 

q = argmin || /i O (c - q) || . (39) 

As mentioned above, the indicator vector h enables to select 
the relevant DP-RTF vector components, i.e. the ones cor¬ 
responding to frequencies with (over)determined solution to 
( [24] ). Because of the sparse nature of the test feature vectors, 
not any regression technique could be used. Indeed, one needs 
a regression method that allows training with full-spectrum 
signals and testing with sparse-spectrum signals. Moreover, 
the input DP-RTF vectors are high dimensional and not any 
regression method can handle high-dimensional input data. For 
these reasons we adopted the probabilistic piece-wise linear 
regression technique of 0 . 

VI. Experiments with Simulated Data 

We report results with experiments carried out in order to 
evaluate the performance of the proposed method. We simu¬ 
lated various experimental conditions in terms of reverberation 
and additive noise. 


A. The Dataset 

The BRIRs are generated with the ROOMSIM simulator 
(341 and with the head related transfer function (HRTF) of a 
KEMAR dummy head (35l . The responses are simulated in 
a rectangular room of dimension 8mx5mx3m. The 
KEMAR dummy head is located at (4,1,1.5) m. The sound 
sources are placed in front of the dummy head with azimuths 
varying from —90° to 90°, spaced by 5°, an elevation of 0°, 
and distances of 1 m, 2 m, and 3 m., see Figj^ 

The absorption coefficients of the six walls are equal, 
and adjusted to control Teo at 0.22 s, 0.5 s and 0.79 s, 
respectively. Two other quantities, i.e. the ITDG and the direct- 
to-reverberation ratio (DRR), are also important to measure 
the intensity of the reverberation. In general, the larger the 
source-to-sensors distance is, the smaller the ITDG and DRR 
are. For example, when Teo is 0.5 s, the DRRs for 1, 2, 3 m 
are about 1.6, —4.5 and —8.1 dB, respectively. Speech signals 
from the TIMIT dataset (^ are used as the speech source 
signals, which are convolved with the simulated BRIRs to 
generate the sensor signals. Each BRIR is convolved with 10 
different speech signals from TIMIT to achieve reliable SSL 
results. Note that the elevation of the speech sources is always 
equal to 0° in the BRIR dataset, thence in these simulated-data 
experiments the source direction corresponds to the azimuth 
only. The feature vectors in the training set are 

generated with the anechoic HRIRs of the KEMAR dummy 
head from the azimuth range [—90° ,90°], spaced by 5°, i.e. 
/ = 37. In this section, the nearest-neighbor search is adopted 
for localization. 
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Fig. 2: Configurations of room, dummy head, speech sources and 
noise source for the BRIR dataset. 


Two types of noise signals are generated: (i) a “directional 
noise” is obtained by convolving a single channel WGN signal 
with a BRIR corresponding to position beside the wall with 
azimuth of 120°, elevation of 30° and distance of 2.2 m, see 
Fig. 1^ (ii) an “uncorrelated noise” consists of an independent 
WGN signal on each channel. Noise signals are added to the 
speech sensor signals with various signal-to-noise ratios. 


B. Setting the Parameters 

The sampling rate is 16 kHz. Only the frequency band 
from 0 to 4 kHz is considered for speech source localization. 
The setting of all three parameters N, Qk and D is crucial 
for a good estimation of the DP-RTF. Intuitively, Qk should 
correspond to the value of Tqq at the k-th frequency bin. For 
simplicity, we set to be the same for all frequencies and 
denote it as Q. In the following of this subsection, we present 
preliminary SSL experiments that were done in order to tune 
N, Q and D to an “optimal tradeoff” setting that would ensure 
good SSL performance for a large range of acoustic conditions. 
Since considering all possible joint settings of these three 
parameters is a hard task, when exploring the setting of one 
of them, we may fix the others. 

In all the following, the localization error is taken as the 
performance metric. It is computed by averaging the absolute 
errors between the localized directions and their corresponding 
ground truth (in degrees) over the complete test dataset. 

Let us first consider the setting of Q. Here we fix N = 
256 with 50% overlap, and D = 12. Table shows the 
localization errors for Q values corresponding to CTF length 
G [O.lTeo, ... ,0.4T6o] with Tqq = 0.5 s. When the SNR is 
high (first four lines; SNR = 10 dB), the influence of noise 
is small, and the DRR plays a dominant role. Comparing the 
localization errors for source-to-sensors distances between 1 m 
and 2 m, we see that small localization errors are obtained with 
rather small Q values for 1 m, and with the larger Q values 
for 2 m. This result indicates that, for a given Tqq, Q should 
be increased when the DRR is decreased. The CTF should 
cover most of the energy of the room impulse response. By 
comparing the results for the uncorrelated noise of 10 dB and 
—5 dB, source at 2 m (second and fifth lines), we observe 
that the smallest localization error is achieved by a smaller Q 


for the low SNR case, compared to the high SNR case. Note 
that a larger Q corresponds to a greater model complexity, 
which needs more reliable (less noisy) data to be estimated. 
The intense uncorrelated noise degrades the data, thence a 
small Q is preferred. In contrast, for the directional noise, a 
large Q is also suitable for the low SNR case (sixth line). The 
reason is possibly that the directional noise signal has a similar 
convolution structure as the speech signal, and the noise 
residual e{k) also has a similar convolution structure. Thence 
the data reliability is not degraded much. In conclusion, the 
optimal Q varies with the Tqq, DRR, noise characteristics, and 
noise intensity. In practice, it is difficult to obtain these features 
automatically, thence we assume that Tqq is known, and we set 
Q to correspond to 0.25T60 as a tradeoff for different acoustic 
conditions. 


Let us now consider the setting of D. Here, we set Q 
to correspond to 0.25T60, and N = 256 with 50% overlap. 
The number of frames D is crucial for an efficient spectral 
subtraction (Section IV-B). A large D yields a small noise 
residual. However, the remaining speech power after spectral 
subtraction may also be small because of the small fluctuations 
of the speech PSD estimate between frames when D is large. 
Table shows the localization errors for D G [6 ,..., 20] 
under different conditions. Note that only the results for the 
low SNR case (—5 dB) are shown, for which the effect of 
noise suppression plays a more important role. It can be seen 
(first line) that a large D yields the smallest localization error, 
which means that removing noise power is more important 
than retaining speech power for this condition. The reason is 
that the DRR is large for source-to-sensors distance of 1 m, 
so that the direct-path speech power is relatively large. As D 
increases, the remaining direct-path speech power decreases 
only slightly, compared to the decrease of the noise residual. 
In contrast, a small D yields the smallest localization error 
for the directional noise at 2 m (fourth line), which means 
that retaining speech power is more important than removing 
noise power for this condition. The reasons are that (i) as 
described above, the data reliability is not degraded much by 
the directional noise in the sense of convolution, and (ii) the 
direct-path speech power is relatively small for a source-to- 
sensors distance of 2 m. The conditions of the second and third 
lines fall in between the first line and the fourth line, and these 
results do not strongly depend on D. It is difficult to choose a 
D value that is optimal for all the acoustic conditions. In the 
following, we set D = 12 frames (100 ms) as a fair tradeoff. 


As for the setting of N, let us remind that the reflections 
present in a{n)\^^Q lead to a biased definition of DP-RTF. In 
order to minimize the reflections contained in a{n)\^^Q, the 
STFT window length N should be as small as possible, while 
still capturing the direct-path response. However, in practice, 
a small N requires a large Q for the CTF to cover well the 
room impulse response, which increases the complexity of the 
DP-RTF estimate. We tested the localization performance for 
three STFT window sizes: 8 ms (N = 128 samples), 16 ms 
(N = 256 samples), and 32 ms (N = 512 samples), with 50% 
overlap. Again, Q corresponds to 0.25T60. For example, with 
Tqo = 0.79 s and with N = 128, 256, 512 respectively, Q is 
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TABLE I: Localization errors (degrees) for different values of Q in different conditions. Teo = 0.5 s. “Distance” stands for source-to-sensors 
distance. The bold value is the minimum localization error for each condition. 


Conditions 
Noise type SNR 

Distance 

0.1 

0.15 

Q/Tqo (Teo = 
0.2 0.25 

0.5 s) 
0.3 

0.35 

0.4 

Uncorrelated 

10 dB 

1 m 

0.122 

0.081 

0.077 

0.081 

0.099 

0.108 

0.113 

Uncorrelated 

10 dB 

2 m 

1.338 

0.847 

0.716 

0.649 

0.629 

0.608 

0.568 

Directional 

10 dB 

1 m 

0.135 

0.113 

0.122 

0.131 

0.149 

0.158 

0.162 

Directional 

10 dB 

2 m 

1.437 

0.869 

0.829 

0.680 

0.644 

0.626 

0.617 

Uncorrelated 

-5 dB 

2 m 

7.824 

6.833 

6.703 

6.680 

6.802 

6.964 

7.149 

Directional 

-5 dB 

2 m 

13.36 

12.25 

11.90 

11.23 

10.96 

10.52 

10.38 


TABLE II: Localization errors (degrees) for different values of D in different conditions. Teo = 0.5 s. “Distance” stands for source-to-sensors 
distance. The bold value is the minimum localization error for each condition. 


Conditions 
Noise type SNR 

Distance 

6 

8 

10 

D frames 

12 14 

16 

18 

20 

Uncorrelated 

-5 dB 

1 m 

2.59 

2.15 

2.09 

1.99 

1.86 

1.81 

1.64 

1.59 

Uncorrelated 

-5 dB 

2 m 

7.37 

6.03 

6.17 

6.68 

6.08 

6.40 

6.90 

6.50 

Directional 

-5 dB 

1 m 

3.83 

3.42 

3.51 

3.23 

3.70 

3A1 

2.96 

3.45 

Directional 

-5 dB 

2 m 

9.80 

10.28 

10.32 

11.23 

11.60 

13.18 

13.62 

15.35 


TABLE III: Localization errors (degrees) for three values of N. 
“Distance” is the sensors-to-source distance. The bold value is the 
minimum localization error. In this experiment, the noise signal is 
generated by summing the directional noise and uncorrelated noise 
with identical powers. 


SNR 

Conditions 

Distance 

Teo 

STFT window length N 

128 (8 ms) 256 (16 ms) 512 (32 ms) 


1 m 

0.22 s 

0.01 

0.01 

0.02 

10 dB 

3 m 

0.22 s 

0.58 

1.19 

1.89 


3 m 

0.79 s 

9.60 

9.22 

9.55 


1 m 

0.22 s 

1.89 

1.62 

1.49 

-5 dB 

3 m 

0.22 s 

8.07 

6.30 

7.04 


3 m 

0.79 s 

22.66 

20.81 

17.75 


equal to 50, 25, 13 frames respectively. D is set to 100 ms. 
Eor N = 128, 256, 512, D is 24, 12, 6 frames, respectively. 
Table |III| shows the localization errors under various acoustic 
conditions. We first discuss the case of high SNR (first three 
lines). When the source-to-sensors distance is small (1 m; 
first line), the ITDG is relatively large and we observe that 
N = 128 and N = 256 (8 ms and 16 ms windows) 
achieve comparable performance. This indicates that, if the 
ITDG is relatively large, there are not much more refiections 
in a{n)\^^Q for a 16-ms window, compared with an 8-ms 
window. The next results (second line) show that, when ^60 
is small (0.22 s), the localization performance decreases much 
more for a 16-ms and a 32-ms window than for an 8-ms 
window, as the sensor-to-noise distance increases from 1 m to 
3 m. A lower ITDG yields a larger DP-RTE estimation error 
due to the presence of more refiections in a{n)\^^Q. When Teo 
increases to 0.79 s, Q becomes larger, especially for = 128. 
It can be seen (third line) that here = 256 yields a better 
performance than other values. This is because the lack of data 
leads to a large DP-RTE estimation error for N = 12S, and the 
refiections in a{n)\^^Q bring a large DP-RTE estimation error 
for A^ = 512. When the SNR is low (—5 dB; last three lines), 
less reliable data are available due to noise contamination. In 
that case, a large N achieves the best performance. Einally, 
we set N = 256 (16-ms STET window) as a good overall 
tradeoff between all tested conditions. 


C. DP-RTF Estimation 

We provide several representative examples showing the 
infiuence of both reverberation and noise on the DP-RTE esti¬ 
mates. The phase and normalized amplitude of the estimated 
DP-RTE for three acoustic conditions are shown in Eig.|^ The 
SNR is set to 30 dB in the first two examples, hence the noise 
is negligible. The difference between the estimated and the 
ground-truth phase is referred to as the phase estimation error. 
It can be seen that, for most frequency bins, the mean value 
(over ten trials) of the phase estimation error is very small (but 
nonzero, which indicates that the estimated DP-RTE is biased). 
As mentioned above, the bias is brought in by the refiections 
in the impulse response segment a{n)\^^Q. In addition, if the 
DRR gets smaller, a longer GTE is required to cover the room 
impulse response. However, for a given Teo, the GTE length Q 
is set as a constant, for instance 0.25T60. In this example, this 
improper value of Q leads to an inaccurate GTE model, which 
causes the DP-RTE estimate bias. When the source-to-sensors 
distance increases, both the ITDG and DRR become smaller. 
Therefore, for both phase and amplitude, the estimation bias 
of the second example of Eig. (middle) is larger than the 
bias of the first example (left). Moreover, the DP-RTE in 
g/c plays a less important role relative to other elements, with 
decreasing DRR, which makes the variance of both the phase 
and amplitude estimation errors to be larger than in the first 
example. By comparing the first and last examples of Eig. it 
is not surprising to observe that the estimation error increases 
as noise power increases. When the SNR is low, less reliable 
speech frames are available in the high frequency band, due 
to the intense noise. Therefore, there is no DP-RTE estimation 
for the frequency bins satisfying Pi < 2Qk — 1. 

D. Baseline Methods 

In our previous work ca, the proposed inter-frame spectral 
subtraction scheme was applied to RTE estimators (as opposed 
to the DP-RTE estimators proposed in the present paper). The 
results were compared with the RTE estimators proposed in 
0 and ED in the presence of WGN or babble noise. The 
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Fig. 3: The phase (top) and normalized amplitude (bottom) of the normalized estimated DP-RTF \3S\ as a function of frequency bins. 
The source direction is 30°. Tqq = 0.5 s. The continuous curve corresponds to the ground-truth DP-RTF dk computed from the anechoic 
HRTF. Left: 1 m source-to-sensors distance, 30 dB SNR. Middle: 2 m source-to-sensors distance, 30 dB SNR. Right: 1 m source-to-sensors 
distance, 0 dB SNR. For each acoustic condition, the BRIR is convolved with 10 different speech recordings as the sensor signals, whose 
DP-RTF estimations are all shown. In this experiment, the noise signal is generated by summing the directional noise and uncorrelated noise 
with identical powers. 


efficiency of the inter-frame spectral subtraction to remove the 
noise was demonstrated. Thence, the focus of the present set 
of experiments is mainly aimed at (i) comparing the robustness 
to reverberation of the proposed DP-RTF feature with respect 
to other features, in a similar SSL framework, and at (ii) 
comparing the proposed SSL method with a conventional SSL 
method. 


applied to the frames that have high speech power and a 
coherence larger than the threshold, and then are averaged over 
frames for RTF estimation, (iii) a classic one-stage algorithm: 
the steered-response power (SRP) utilizing the phase transform 
(PHAT) 1^ , ll40l . The azimuth directions —90° : 5° : 90° are 
taken as the steering directions, and their HRIRs are used as 
the steering responses. 


To this aim, we compare our method with three other 
methods: (i) an unbiased RTF identification method (El, in 
which a spectral subtraction procedure (similar to the one 
described in Section |IV-B| ) is used to suppress noise. Since 
this RTF estimator is based on the MTF approximation, we 
refer to this method as RTF-MTF. (ii) a method based on 
a STFT-domain coherence test (CT) ||22l|^We refer to this 
method as RTF-CT. The coherence test is used in 1221 to 
search the rank-1 time-frequency bins which are supposed to 
be dominated by one active source. We adopt the coherence 
test for single speaker localization, in which one active source 
denotes the direct-path source signal. The TF bins that involve 
notable reflections have low coherence. We first detect the 
maximum coherence over all the frames at each frequency bin, 
and then set the coherence test threshold for each frequency 
bin to 0.9 times its maximum coherence. In our experiments, 
this threshold achieves the best performance. The covariance 
matrix is estimated by taking a 120 ms (15 adjacent frames) 
averaging. The auto- and cross-PSD spectral subtraction is 


^Note that (2T] introduces a similar technique based on interaural coher¬ 
ence, using features extracted from band-pass hlter banks. Also, a binaural 
coherent-to-diffuse ratio approach was proposed in (^, (21 and applied to 
dereverberation but not to SSL. 


Note that for both RTF-MTF and RTF-CT methods, the 
features used in the SSL are obtained after the inter-frame 
spectral subtraction procedure. The SSL method presented in 
Section |V] is adopted. The training set used as a look-up table 
or used for training the regression is the same as for the DP- 
RTF. 


E. Localization Results 

Fig. 1^ shows the localization results in terms of localization 
error (let us remind that this error is an average absolute 
error between the localized directions and their corresponding 
ground truth (in degrees) over the complete test dataset). Note 
that in real world, directional noise source, e.g. fan, refrigera¬ 
tor, etc., and diffuse background noise co-exist. Thence in this 
experiment, the noise signal was generated by summing the 
directional noise and uncorrelated noise with identical powers. 

Let us first discuss the localization performance shown 
in Fig. I^top for Teo = 0.22 s. When the DRR is high 
(1 m source-to-sensors distance; solid-line), compared with the 
proposed method, RTF-MTF has a comparable performance 
under high SNR conditions, and a slightly better performance 
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under low SNR conditions (lower than 0 dB). This indicates 
that when the reverberation is low, the MTF approximation is 
valid. When less reliable data are available (under low SNR 
conditions), the proposed method perform slightly worse than 
RTF-MTF due to its greater model complexity. Note that both 
the RTF-MTF and the proposed DP-RTF methods achieve very 
good localization performance: The localization error goes 
from almost 0° at SNR = 10 dB to about 5° at SNR = —10 dB. 
RTF-CT achieves the worst performance. This indicates that 
when the direct-path impulse response is slightly contaminated 
by the reflections, employing all the data (as done by RTF- 
MTF and DP-RTF) obtains a smaller localization error than 
employing only the data selected by the coherence test. In 
general, for mild reverberations, the performance gap between 
RTF-MTF, RTF-CT and the proposed method is small and the 
noise level plays a decisive role for good localization. 

The SRP-PHAT method achieves comparable performance 
measures with the three other methods when the SNR is 
high (10 dB). However, the performance measures of SRP- 
PHAT degrades immediately and dramatically when the SNR 
decreases. The steered-response power is severely influenced 
by intense noise, especially by the directional noise. This 
indicates that the inter-frame spectral subtraction algorithm 
applied to RTF-MTF, RTF-CT and the proposed method is 
efficient to reduce the noise. 

When the DRR decreases (2 m source-to-sensors distance, 
grey lines; 3 m source-to-sensors distance, dashed lines), the 
performance measures of RTF-MTF degrades notably. For 
SNR = 10 dB, the localization error of RTF-MTF increases 
from 0.07° to 1.51° and to 6.35° for source-to-sensors dis¬ 
tances of 1 m, 2 m and 3 m, respectively. The direct-path 
impulse response is severely contaminated by the reflections. 
At high SNRs, RTF-CT performs slightly better than RTF- 
MTF. Indeed, RTF-CT selects the frames that contain less re¬ 
verberations for calculating the RTF estimate, which improves 
the performance at high SNR conditions. However, when the 
noise level increases, the precision of RTF-CT also degrades. 
The performance of RTF-CT is influenced not only by the 
residual noise but also by the decline of the coherence test 
precision, which make it fall even faster than RTF-MTF with 
decreasing SNR (it has a larger localization error at —5 dB 
and -10 dB). 

The proposed method also has a larger localization error 
when the source-to-sensors distance increases: the DP-RTF 
estimation is possibly influenced by the increased amount of 
early reflections in the impulse response segment a{n)\^^Q, 
by the effect of an improper Q setting, and by the decreased 
importance of in vector g/.. However, the performance 
of the proposed DP-RTF method degrades much slower than 
the ones of RTF-MTF when the source distance increases. 
For an SNR of 10 dB, the localization error of the proposed 
method increases from 0.06° to 0.16° and 1.19° as the source- 
to-sensors distance increases from 1 m to 2 m and 3 m. It can 
be seen that the performance of the proposed method also falls 
faster than RTF-MTF with decreasing SNR, since the available 
data is less reliable. The localization error of the proposed 


I-1 m 2 m-3 m | 

□ Proposed o RTF-MTF , RTF-CT + SRP-PHAT 





SNR(dB) 


Fig. 4: Localization errors under various reverberation and noise 
conditions. Top: Tqq = 0.22 s. Middle: Tqq = 0.5 s. Bottom: Tqq = 
0.79 s. The localization errors are shown as a function of SNR for 
source-to-sensors distances of 1 m, 2 m and 3 m. 


method is larger than the MTF error at -10 dB. It is observed 
that the proposed method prominently outperforms RTF-CT. 
It is shown in 1^ that the coherence test is influenced by 
the coherent reflections (very early reflections) of the source 
signal. Moreover, it is difficult to automatically set a coherence 
test threshold that could perfectly select the desired frames. 
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Many frames that have a coherence larger than the threshold 
include reflections. 

The performance of SRP-PHAT also degrades with the 
DRR decrease. It is known that PHAT-based method are 
quite sensible to reverberations and noise in general. Briefly, 
the performance measures of SRP-PHAT are in between the 
performance measures of RTF-MTF and RTF-CT for high 
SNRs, which indicates that the PHAT weight could suppress 
the reverberations only to a certain extent. Below 5 dB, SRP- 
PHAT performs worst of the four methods. 

Fig. (bottom) displays the results for Teo = 0.79 s. 
Obviously, the performance measures of all four methods 
degrade as Teo increases. Indeed, the MTF approximation 
is not accurate; there are only a few time-frequency bins 
with a rank-1 coherence; and a large value of Q has to be 
utilized in the proposed method, for which there may not 
always be enough reliable data. Here, it can be seen that RTF- 
CT performs better than RTF-MTF for any SNR value and 
source-to-sensors distance. Even SRP-PHAT performs better 
than RTF-MTF (for 2 m and 3 m source-to-sensors distance). 
This shows that the RTF estimation error brought by the 
MTF approximation largely increases as Teo increases. For 
1 m source-to-sensors distance, the proposed method performs 
slightly better than all other three methods. For 2 m and 
3 m source-to-sensors distance, the proposed method largely 
outperforms the other three methods, at all SNRs. For example, 
at SNR = 0 dB, the proposed method achieves about 6.5° 
of localization error at 2 m source-to-sensors distance, while 
RTF-CT (the best of the three baseline methods) achieves 
about 15.8°, hence the gain for the proposed method over the 
best baseline is about 9.3°. However, the performance of the 
proposed method and of RTF-CT still have a faster degradation 
with decreasing SNR compared to RTF-MTF. 

Finally, we can see from Fig. (middle), that the per¬ 
formance of the different methods for Tqq = 0.5 s falls in 
between the other two cases shown on the same figure, and 
the trends of performance evolution with Tqq is consistent with 
our comments above. 

In summary, the proposed method outperforms the three 
other methods under most acoustic conditions. In a general 
manner, the gain over the baseline methods increases as the 
source-to-sensors distance increases (or the DRR decreases) 
and as the reverberation time increases (but the influence of 
the noise level is more intricate). As a result, the proposed 
method achieves acceptable localization performance in quite 
adverse conditions. For example (among many others), with 
Teo = 0.5 s, source-to-sensors distance of 3 m and an SNR of 
0 dB, the localization error is about 9°, and with Teo = 0.79 s, 
source-to-sensors distance of 2 m, and an SNR of 0 dB, the 
localization error is about 6.5°. 

In all the above results, the duration of the signal used 
for localization was not considered with great attention: The 
localization errors were averaged over 10 sentences of TIMIT 
of possibly quite different duration, from 1 s to 5 s. Yet 
the number of available frames that are used to construct 
( [24| ) depends on the speech duration, which is crucial for the 


TABLE IV: Localization errors (in degrees) as a function of speech 
duration, for Teo = 0.5 s and a source-to-sensors distance of 2 m. 


SNR 

Method 

1 

Speech duration (s) 
2 3 

4 


Proposed 

1.57 

0.88 

0.79 

0.54 

10 dB 

RTF-CT 

6.24 

4.43 

3.86 

3.21 


RTF-MTF 

12.60 

12.01 

11.25 

11.16 


Proposed 

7.36 

4.62 

4.05 

3.07 

0 dB 

RTF-CT 

12.97 

11.33 

10.04 

9.67 


RTF-MTF 

17.56 

15.29 

14.94 

15.01 


least square DP-RTF estimation in ( [25] ). Here we complete 
the simulation results with a basic test of the infiuence of 
the speech duration on localization performance. To this aim 
we classified our TIMIT test sentences according to their 
duration (closer to 1 s, 2 s, 3 s or 4 s) and proceeded to 
localization evaluation for each new group (of 10 sentences), 
for a limited set of acoustic conditions (SNR = 10 dB and 
0 dB, Teo = 0.5 s). Table |lv| shows the localization errors 
of the proposed method, the RTF-MTF, and the RTF-CT 
method, for the four tested approximate speech durations. 
We can see that, as expected, all three methods achieve a 
smaller localization error when increasing speech duration, for 
both tested SNRs. The improvement is more pronounced for 
the proposed method and the RTF-CT method compared to 
the RTF-MTF method. For example, for SNR = 10 dB, the 
localization error is reduced by 66% (from 1.57° to 0.54°) 
for the proposed method, and by 49% (from 6.24° to 3.21°) 
for the RTF-CT method when the speech duration rises from 
1 s to 4 s. In contrast, the localization error of RTF-MTF 
is quite larger and is only reduced by 11% (from 12.60° to 
11.16°). 


VH. Experiments with the NAO Robot 


In this section we present several experiments that were 
conducted using the NAO robot (Version 5) in various real- 
world environments. NAO is a humanoid companion robot de¬ 
veloped and commercialized by Aldebaran Robotic s|^ NAO’s 
head has four microphones that are nearly coplanar, see Eig. [^ 
The recordings contain ego-noise, i.e. noise produced by 
the robot. In particular, it contains a loud fan noise, which 
is stationary and partially interchannel correlated BTl . The 
spectral energy of the fan noise is notable up to 4 kHz, thence 
the speech signals are significantly contaminated. Note that 
the experiments reported in this section adopt the parameter 


settings discussed in Section VLB 


A. The Datasets 

The data are recorded in three environments: laboratory, 
office, e.g., Eig. [^(right), and cafeteria, with reverberation 
times (Teo) that are approximately 0.52 s, 0.47 s and 0.24 s, 
respectively. Two test datasets are recorded in these environ¬ 
ments: 

1) The audio-only dataset: In the laboratory, speech utterances 

^https://www.ald.softbankrobotics.com. 
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Fig. 5: NAO’s head has four microphones and one camera. 



Fig. 6: The audio-visual training dataset (left) is obtained by moving 
a loudspeaker in front of a microphone/camera setup. Sounds are 
emitted by a loudspeaker. A LED placed on the loudspeaker enables 
to associate each sound direction with an image location (a blue 
circle). The data contain pairs of acoustic recordings and sound 
directions. A typical localization scenario with the NAO robot (right). 


from the TIMIT dataset ll^ are emitted by a loudspeaker 
in front of NAO. Two groups of data are recorded with a 
source-to-robot distance of 1.1 m and 2.1 m, respectively. 
For each group, 174 sounds are emitted from directions 
uniformly distributed in azimuth and elevation, in the range 
[—120°, 120°] (azimuth), and [—15°,25°] (elevation). 

2) The audio-visual dataset: Sounds are emitted by a loud¬ 
speaker lying in the field of view of NAO’s camera. The image 
resolution is of 640 x 480 pixels, corresponding to approxi¬ 
mately 60° (—30° to 30°) azimuth range and to approximately 
48° (—24° to 24°) elevation range, so 1° of azimuth/elevation 
corresponds to approximately 10.5 horizontal/vertical pixels. 
A LED placed on the loudspeaker enables to estimate the loud¬ 
speaker location in the image, hence ground-truth localization 
data are available with the audio-visual dataset. Three sets 
of audio-visual data are recorded in three different rooms. 
For each set, sounds are emitted from about 230 directions 
uniformly distributed in the camera field-of-view. Fig. [^(left) 
shows the source positions shown as blue dots in the image 
plane. The source-to-robot distance is about 1.5 m in this 
dataset. 


In both datasets, ambient noise is much lower than fan noise, 
hence the noise of recorded signals mainly corresponds to fan 
noise. In the case of the audio-only dataset, the SNR is 14 dB 
and 11 dB for source-to-robot distances of 1.1 m and 2.1 m, 
respectively. For the audio-visual dataset the SNR is 2 dB. 


The training dataset for the audio-only localization experi¬ 
ments is generated with the NAO head HRIRs of 1, 002 direc¬ 
tions uniformly distributed over the same azimuth-elevation 
range as the test dataset. The training dataset for audio¬ 
visual experiments is generated with the NAO head HRIR of 
378 directions uniformly distributed over the camera field-of- 
view. HRIRs are measured in the laboratory: white Gaussian 
noise is emitted from each direction, and the cross-correlation 
between the microphone and source signals yields the BRIR 
of each direction. In order to obtain anechoic HRIRs, the 
BRIRs are manually truncated before the first reflection. The 
regression method of (6), outlined in Section |Vj is used 
for supervised localization. The SRP-PHAT method takes the 
source directions in the training set as the steering directions. 
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Fig. 7: Azimuth estimation for the audio-only dataset. Source-to- 
robot distance is 1.1 m (top) and 2.1 m (bottom). 


B. Localization Results for the Audio-Only Dataset 

Experiments with the audio-only dataset first show that 
elevation estimation in the range [—15° 25°] is unreliable for 
all the four methods. This can be explained by the fact that 
the four microphones are coplanar. Therefore we only present 
the azimuth estimation results in the following. 

The azimuth estimation results for the audio-only dataset 
are given in Fig. The results are quite consistent across 
the two conditions, i.e. source-to-robot distance of 1.1 m 
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(Fig. [^top) and 2.1 m (Fig. [^bottom). Globally, for the 
azimuth range [—50°,50°] all four methods provide good 
localization, i.e. they follow the ground-truth line quite well, 
for both source-to-robot distances. In this range, the proposed 
method achieves slightly better results than the RTF-MTF 
and RTF-CT methods. The performance of all methods drops 
significantly for directions out of this range, but globally, the 
proposed method remains the closest to the ground-truth. In 
more details, in the approximate range [—120°,—50°] and 
[50°, 120°] it can be seen that SRP-PHAT and RTF-MTF have 
the largest localization error and many localization outliers 
caused by reverberations (SRP-PHAT performs slightly better 
than RTF-MTF in the zones just after —50° and 50°, pos¬ 
sibly due to PHAT weighting ). By selecting frames that 
involve less reverberations, RTF-CT performs slightly better 
than RTF-MTF. The proposed method outperforms the others 
by extracting the binaural cues associated with the direct- 
path propagation. Importantly, in the extremities of the range, 
the proposed method does not generate major outliers nor 
large deviation from the ground-truth, as opposed to the other 
methods. 

C. Localization Results for the Audio-Visual Dataset 

The azimuth and elevation in the audio-visual dataset are 
limited to a small range around 0° azimuth. As a consequence, 
both the azimuth and elevation localization results of this 
dataset are better than the results of audio-only dataset in 
average. Table |V] shows the localization errors for azimuth 
(Azim.) and elevation (Elev.) for the audio-visual dataset. The 
elevation errors are always larger than the azimuth errors, due 
to the low elevation resolution of the microphone array that 
we already mentioned (the microphone are coplanar and the 
microphone plane is horizontal). The cafeteria has the smaller 
reverberation time, ^60 — 0 .24 s. Consequently, the RTF- 
MTF and RTF-CT methods yields performance measures that 
are comparable with the proposed method. The office and 
laboratory have larger reverberation times, 0.47 s and 0.52 s, 
respectively, so the MTF approximation is no more accurate. 
A bit surprisingly RTF-MTF performs better than RTF-CT for 
the office (though the errors are quite close), this is probably 
due to the fact that the coherence test does not work well 
under low SNR conditions (let us remind that the SNR of the 
audio-visual dataset is around 2 dB). Globally, SRP-PHAT 
performs the worst, due to the intense noise. As a result of 
the presence of notable reverberations, the proposed method 
performs here significantly better than the three other methods. 
For example, in the laboratory environment, the proposed 
method provides 0.84° azimuth error and 1.84° elevation error, 
vs. 1.41° azimuth error and 2.30° elevation error for the 
best baseline methods (for instance SRP-PHAT and RTF-MTF 
respectively). 

VHI. Conclusion 

We proposed a method for the estimation of the direct- 
path relative transfer function (DP-RTF). Compared with the 


TABLE V: Localization error (in degrees) for the audio-visual 
dataset. The best results are shown in bold. 


Method 

Cafeteria 
Azim. Elev. 

Office 

Azim. Elev. 

Laboratory 
Azim. Elev. 

RTF-MTF 

0.47 

1.58 

0.62 

2.14 

1.46 

2.30 

RTF-CT 

0.43 

1.49 

0.68 

2.30 

1.59 

2.40 

SRP-PHAT 

0.77 

1.95 

1.03 

2.80 

1.41 

3.33 

Proposed 

0.48 

1.46 

0.55 

1.86 

0.84 

1.84 


conventional RTF, the DP-RTF is defined as the ratio between 
two direct-path acoustic transfer functions. Therefore, the DP- 
RTF definition and estimation implies the removal of the rever¬ 
berations, and it provides a more reliable feature, in particular 
for sound source localization. To estimate the DP-RTF, we 
adopted the convolutive transfer function (CTF) model instead 
of the multiplicative transfer function (MTF) approximation. 
By doing this, the DP-RTF can be estimated by solving a set 
of linear equations constructed from the reverberant sensor 
signals. Moreover, an inter-frame spectral subtraction method 
was proposed to remove noise power. This spectral subtraction 
process does not require explicit estimation of the noise PSD, 
hence it does not suffer from noise PSD estimation errors. 

Based on the DP-RTF we proposed a supervised sound- 
source localization algorithm. The latter relies on a train¬ 
ing dataset that is composed of pairs of DP-RTF feature 
vectors and their associated sound directions. The training 
dataset is pre-processed in such a way that it only contains 
anechoic head-related impulse responses. Hence the training 
dataset does not depend on the particular acoustic proper¬ 
ties of the recording environment. Only the sensors set-up 
must be consistent between training and testing (e.g. using 
the same dummy/robot head). In practice we implemented 
two supervised methods, namely a nearest-neighbor search 
and a mixture of linear regressions. Experiments with both 
simulated data and real data recorded with four microphones 
embedded in a robot head, showed that the proposed method 
outperforms an MTE-based method and a method based on a 
coherence test, as well as a conventional SRP-PHAT method, 
in reverberant environments. 

In the presented experiments the model parameters Q, D 
and N (Section |VI-B| ) were set to constant values which were 
chosen as a tradeoff yielding good results in a variety of 
acoustic conditions. In the future, to improve the robustness 
of DP-RTE, we plan to estimate the acoustic conditions using 
the microphone signals, such that an optimal set of parameters 
can be adaptively adjusted. We also plan to extend the DP- 
RTE estimator and its use in SSL to the more complex case 
of multiple sound sources. 
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